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Abstract 

Latent tree graphical models are widely used in computational biology, 
signal and image processing, and network tomography. Here we design a 
new efficient, estimation procedure for latent tree models, including Gaus- 
sian and discrete, reversible models, that significantly improves on previous 
sample requirement bounds. Our techniques are based on a new hidden state 
estimator which is robust to inaccuracies in estimated parameters. More pre- 
cisely, we prove that latent tree models can be estimated with high probabil- 
ity in the so-called Kesten-Stigum regime with 0(log 2 n) samples. 
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1 Introduction 



Background Latent tree graphical models and other related models have been 
widely studied in mathematical statistics, machine learning, signal and image pro- 
cessing, network tomography, computational biology, and statistical physics. See 
e.g. [And58, KF09, Wil02, CCL+04, SS03, EKPSOO] and references therein. For 
instance, in phylogenetics [Fel04], one seeks to reconstruct the evolutionary his- 
tory of living organisms from molecular data extracted from modern species. The 
assumption is that molecular data consists of aligned sequences and that each po- 
sition in the sequences evolves independently according to a Markov random field 
on a tree, where the key parameters are (see Section 1.1 for formal definitions): 

• Tree. An evolutionary tree T, where the leaves are the modern species and 
each branching represents a past speciation event. 

• Rate matrix. A q x q mutation rate matrix Q, where q is the alphabet size. 
A typical alphabet arising in biology would be {A, C, G, T}. Without loss 
of generality, here we denote the alphabet by [q] = {1, . . . , q}. The (i, j)'th 
entry of Q encodes the rate at which state i mutates into state j. We normal- 
ize the matrix Q so that its spectral gap is 1. 

• Edge weights. For each edge e, we have a scalar branch length r e which 
measures the total amount of evolution along edge e. (We use edge or 
branch interchangeably.) Roughly speaking, r e is the time elapsed between 
the end points of e. (In fact the time is multiplied by an edge-dependent 
overall mutation rate because of our normalization of Q.) We also think of 
r e as the "evolutionary distance" between the end points of e. 

Other applications, including those involving Gaussian models (see Section 1.1), 
are similarly defined. Two statistical problems naturally arise in this context: 

• Tree Model Estimation (TME). Given k samples of the above process at the 
observed nodes, that is, at the leaves of the tree, estimate the topology of 
the tree as well as the edge weights. 

• Hidden State Inference (HSI). Given a fully specified tree model and a single 
sample at the observed nodes, infer the state at the (unobserved) root of the 
tree. 

In recent years, a convergence of techniques from statistical physics and theoret- 
ical computer science has provided fruitful new insights on the deep connections 
between these two problems, starting with [Mos04]. 
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Steel's Conjecture A crucial parameter in the second problem above is r + (T) = 
max e T e , the maximal edge weight in the tree. For instance, for the two-state sym- 
metric Q also known as the Ising model, it is known that there exists a critical 
parameter g£s = m such that, if r + (T) < g^ s , then it is possible to perform 
HSI (better than random; see the Section 2.5 for additional details). In contrast, if 
t + (T) > g£s> there exist trees for which HSI is impossible, that is, the correla- 
tion between the best root estimate and its true value decays exponentially in the 
depth of the tree. The regime t + (T) < g^ s is known as the Kesten-Stigum (KS) 
regime [KS66]. 

A striking and insightful conjecture of Steel postulates a deep connection be- 
tween TME and HSI [SteOl]. More specifically the conjecture states that for 
the Ising model, in the KS regime, high-probability TME may be achieved with a 

number of samples k = O (log n) . Since the number of trees on n labelled leaves is 
2 e(nio g n) 5 this 

is an optimal sample requirement up to constant factors. The proof 
of Steel's conjecture was established in [Mos04] for the Ising model on balanced 
trees and in [DMR1 la] for rate matrices on trees with discrete edge lengths. Fur- 
thermore, results of Mossel [Mos03, Mos04] show that for r + (T) > g^ s a poly- 
nomial sample requirement is needed for correct TME, a requirement achieved by 
several estimation algorithms [ESSW99a, Mos04, Mos07, GMS08, DMRllb]. 
The previous results have been extended to general reversible Q on alphabets of 
size q > 2 [Roc 10, MRS 11]. (Note that in that case a more general threshold 
Qq may be defined, although little rigorous work has been dedicated to its study. 
See [MosOl, Sly09, MRS1 1]. In this paper we consider only the KS regime.) 

Our contributions Prior results for general trees and general rate matrix Q, 
when r + (T) < g^ s , have assumed that edge weights are discretized. This as- 
sumption is required to avoid dealing with the sensitivity of root-state inference 
to inexact (that is, estimated) parameters. Here we design a new HSI procedure 
in the KS regime which is provably robust to inaccuracies in the parameters (and, 
in particular, does not rely on the discretization assumption). More precisely, we 
prove that 0(log 2 n) samples suffice to solve the TME and HSI problems in the 
KS regime without discretization. We consider two models in detail: discrete, 
reversible Markov random fields (also known as GTR models in evolutionary bi- 
ology), and Gaussian models. As far as we know, Gaussian models have not 
previously been studied in the context of the HSI phase transition. (We derive the 
critical threshold for Gaussian models in Section 2.5.) Formal statements of our 
results can be found in Section 1.2. Section 1.3 provides a sketch of the proof. 
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Further related work For further related work on sample requirements in tree 
graphical model estimation, see [ESSW99b, MR06, TAW 10, TATW11, CTAW1 1, 
TAW11,BRR10]. 



1.1 Definitions 

Trees and metrics. Let T = (V,E) be a tree with leaf set [n], where [n] = 
{1, . . . , n}. For two leaves a,b G [n], we denote by P(a, b) the set of edges on the 
unique path between a and b. For a node v G V, let N(v) be the neighbors of v. 

Definition 1 (Tree Metric) A tree metric on [n] is a positive function D : [n] x 
[n] — > (0, +oo) 5mc/? ?/zaf ?/zere exz'sfs a tree T = (V, E) with leaf set [n] and an 
edge weight function w : E — > (0, +oo) satisfying the following: for all leaves 

a,b E [n] 



In this work, we consider dyadic trees. Our techniques can be extended to com- 
plete trees of higher degree. We discuss general trees in the concluding remarks. 

Definition 2 (Balanced tree) A balanced tree is a rooted, edge-weighted, leaf- 
labeled h-level dyadic tree T = (V, E, [n],r; r) where: ft > is an integer; V is 
the set of vertices; E is the set of edges; L = [n] = {1, . . . ,n} is the set of leaves 
with n = 2 h ; r is the root; r : E — y (0, +oo) is a positive edge weight function. 
We denote by (r(a, &)) a 6 6 r n i the tree metric corresponding to the balanced tree 
T = (V, E, [n], r; r). We extend t(u, v) to all vertices u, v G V. We let BY n be 
the set of all such balanced trees on n leaves and we let BY = {BY 2 /.}/i>o- 

Markov random fields on trees. We consider Markov models on trees where 
only the leaf variables are observed. The following discrete-state model is stan- 
dard in evolutionary biology. See e.g. [SS03]. Let q > 2. Let [q] be a state set and 
7r be a distribution on [q] satisfying ir x > for all x E [q\. The q x q matrix Q is a 
rate matrix if Q xy > for all x ^ y and J2 y e[ q ] Q^v = ^> f° r a ^ x e [?]■ ^ ne rate 
matrix Q is reversible with respect to n if n x Q xy = n y Q yx , for all x,y G [q]. By 
reversibility, Q has q real eigenvalues = Ai > A 2 > ■ ■ ■ > X q . We normalize Q 
by fixing A 2 = — 1. We denote by Q q the set of all such rate matrices. 



Definition 3 (General Time-Reversible (GTR) Model) For n > 1, let 

T=(V,E, [n],r;r) 




eSP(a,b) 
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be a balanced tree. Let Q be a q x q rate matrix reversible with respect to it. 
Define the transition matrices M e = e Te ®, for all e G E. The GTR model on T 
with rate matrix Q associates a state Z v in [q] to each vertex v in V as follows: 
pick a state for the root r according to it; moving away from the root, choose a 
state for each vertex v independently according to the distribution (Mf -)j e [ g ], 
with e = (u, v) where u is the parent of v. We let GTM n g be the set of all q-state 
GTR models on n leaves. We denote GTM g = {GT~M, 2 h , q } h>0 - We denote by Z w 
the vector of states on the vertices W C V. In particular, Zm are the states at the 
leaves. We denote by T>t,q the distribution ofZ[ n ], 

GTR models encompass several special cases such as the Cavender-Farris-Neyman 
(CFN) model and the Jukes-Cantor (JC) model. 

Example 1 (g-state Symmetric Model) The g-state Symmetric model (also call- 
ed q-state Potts model) is the GTR model with q > 2 states, ix = . . . , 1/g), 
and Q = Q^ POTTS where 



Qti 



q — ifi=3 



g-POTTS _ ) ~ 



O.W. 

Q 



Note that \2(Q) — — 1. The special cases q = 2 and q = 4 are called respectively 
the CFN and JC models in the biology literature. We denote their rate matrices 

A natural generalization of the CFN model which is also included in the GTR 
framework is the Binary Asymmetric Channel. 

Example 2 (Binary Asymmetric Channel) Letting q = 2 and n = (7ri,7r 2 ), 
with Hi , 7T 2 > 0, we can take 



Q 



-7T 2 7T 2 
7Ti -7Ti 



The following transformation will be useful [MP03]. Let v be a right eigenvector 
of the GTR matrix Q corresponding to the eigenvalue —1. Map the state space to 
the real line by defining X x = vz x for all x E [n]. 

We also consider Gaussian Markov Random Fields on Trees (GMRFT). Gaus- 
sian graphical models, including Gaussian tree models, are common in statistics, 
machine learning as well as signal and image processing. See e.g. [And58, Wil02]. 
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Definition 4 (Gaussian Markov Random Field on a Tree (GMRFT)) For n > 

1, let T = (V, E, [n],r; r) be a balanced tree. A GMRFT on T is a multivariate 
Gaussian vector Xy = (X v ) ve y whose covariance matrix E = (E U t,) Uil , g y with 
inverse A = E^ 1 satisfies 

(u, v) <£ E,u ^ v A uv = 0. 

We assume that only the states at the leaves are observed. To ensure identi- 
fiability (that is, to ensure that two different sets of parameters generate different 
distributions at the leaves), we assume that all internal nodes have zero mean and 
unit variance and that all non-leaf edges correspond to a nonnegative correlation. 
Indeed shifting and scaling the states at the internal nodes does not affect the leaf 
distribution. For convenience, we extend this assumption to leaves and leaf edges. 
With the choice 

E uv = p e , u,vEV, 

e£P(u,v) 

where p e = e~ Te , for all e G E, a direct calculation shows that 



A ul) 



P(u,v) 



1-P 



TT 



(u,v) 

0, o.w. 



if{u,v) G E, 



(Note that, in computing (EA) U „ with u ^ v, the product n e 6P(M w) Pe factors out, 
where w G N(v) with (w,v) G P(u,v).) In particular, {— log |E ul) |} ul)g [ n ] is a 
tree metric. We denote by £>r,£ the distribution of 'X[ n y We let GMRFT„ be the 
set of all GMRFT models on n leaves. We denote GMRFT = {GMMFT 2h } /i>0 . 

Remark 1 Our techniques extend to cases where leaves and leaf edges have gen- 
eral means and covariances. We leave the details to the reader. 



Equivalently, in a formulation closer to that of the GTR model above, one can 
think of a GMRFT model as picking a root value according to a standard Gaussian 
distribution and running independent Ornstein-Uhlenbeck processes on the edges. 

Both the GTR and GMRFT models are globally Markov: for all disjoint sub- 
sets A, B, C of V such that B separates A and C, that is, all paths between A 
and C go through a node in B, we have that the states at A are conditionally 
independent of the states at C given the states at B. 
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1.2 Results 



Our main results are the following. We are given k i.i.d. samples from a GMRFT 
or GTR model and we seek to estimate the tree structure with failure probability 
going to as the number of leaves n goes to infinity. We also estimate edge 
weights within constant tolerance. 

Theorem 1 (Main Result: GMRFT Models) Let < / < g < +oo and denote 
by GMR¥T Lg the set of all GMRFT models on balanced trees T = (V, E, [n] , r; r) 
satisfying f < r e < g, Ve G E. Then, for all < / < g < g^ s = In \/2, the 
tree structure estimation problem on GMIRFT^' 9 can be solved with k — k log 2 n 
samples, where k = g) > is large enough. Moreover all edge weights are 
estimated within constant tolerance. 

This result is sharp as we prove the following negative results establishing the 
equivalence of the TME and HSI thresholds. 

Theorem 2IfO<f<g with g > g^ s = In \/2, then the tree structure estima- 
tion problem on GMIRFT'^ cannot, in general, be solved without at least k = n 1 
samples, where 7 = j(f, g) > 0. 

The proof of the theorem is in Section 2. 

Theorem 3 (Main Result: GTR Models) Let < / < g < +00 and denote by 
GTM.f' 9 the set of all q-state GTR models on balanced trees T = (V, E, [n],r; r) 
satisfying f < r e < g, Ve G E. Then, for all q > 2, < / < g < g^ s = In \/2, 
the tree structure estimation problem on GTR^' 9 can be solved with k = k log 2 n 
samples, where k = tz(q, f,g) > is large enough. Moreover all edge weights 
are estimated within constant tolerance. 

The proof of this theorem is similar to that of Theorem 1 . However dealing with 
unknown rate matrices requires some care and the full proof of the modified algo- 
rithm in that case can be found in Section 3. 

Remark 2 Our techniques extend to d-ary trees for general (constant) d > 2. In 
that case, the critical threshold satisfies de~ 2r = 1. We leave the details to the 
reader. 
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1.3 Proof Overview 

We give a sketch of the proof of our main result. We discuss the case of GTR 
models with known Q matrix. The unknown Q matrix and Gaussian cases are 
similar. See Sections 2 and 3 for details. Let (Zf n -,)^ =1 be i.i.d. samples from a 
GTR model on a balanced tree with n leaves. Let (Z v ) be a generic sample from 
the GTR model. 

Boosted algorithm As a starting point, our algorithm uses the reconstruction 
framework of [Mos04]. This basic "boosting" approach is twofold: 

• Initial Step. Build the first level of the tree from the samples at the leaves. 
This can be done easily by standard quartet-based techniques. (See Sec- 
tion 2.2.) 

• Main Loop. Repeat the following two steps until the tree is built: 

1. HSI. Infer hidden states at the roots of the reconstructed subtrees. 

2. One-level TME. Use the hidden state estimates from the previous step 
to build the next level of the tree using quartet-based techniques. 

The heart of the procedure is Step 1. Note that, assuming each level is correctly 
reconstructed, the HSI problem in Step 1 is performed on a known, correct topol- 
ogy. However the edge weights are unknown and need to be estimated from the 
samples at the leaves. 

This leads to the key technical issue addressed in this paper. Although HSI 
with known topology and edge weights is well understood (at least in the so-called 
Kesten-Stigum (KS) regime [MP03]), little work has considered the effect of inex- 
act parameters on hidden state estimation, with the notable exception of [Mos04] 
where a parameter-free estimator is developed for the Ising model. The issue 
was averted in prior work on GTR models by assuming that edge weights are 
discretized, allowing exact estimation [DMRlla, RoclO]. 

Quartet-based tree structure and edge weight estimation relies on the following 
distance estimator. It is natural to use a distance estimator involving the eigenvec- 
tors of Q. Let v be a second right eigenvector of the GTR matrix Q corresponding 
to the eigenvalue — 1. For a E V and i = 1, . . . , k, map the samples to the real 
line by defining X l a = z/^ . Then define 




(1) 
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It can be shown that: For all a, b G V, we have — lnE[e~ T ^ a '^] = r(a, 6). 
Note that, in our case, this estimate is only available for pairs of leaves. More- 
over, it is known that the quality of this estimate degrades quickly as r(a, b) in- 
creases [ESSW99a, Att99]. To obtain accuracy eonar distance with inverse 
polynomial failure probability requires 

k > Ci£- 2 e C2T logn (2) 

samples, where C\, C 2 are constants. We use HSI to replace the X's in (1) with 
approximations of hidden states in order to improve the accuracy of the distance 
estimator between internal nodes. 

Weighted majority For the symmetric CFN model with state space {+1,-1}, 
hidden states can be inferred using a linear combination of the states at the leaves — 
a type of weighted majority vote. A natural generalization of this linear estimator 
in the context of more general mutation matrices was studied by [MP03]. The 
estimator at the root r considered in [MP03] is of the form 

* - £ (Mi) x - (3) 

x6[n] V 7 

where ^ is a unit flow between r and [n]. For any such \T/, S r is a conditionally 
unbiased estimator of X r , that is, K[S r \ X r ] = X r . Moreover, in the KS regime, 
that is, when t + < <j£ s , one can choose a flow such that the variance of S r is 
uniformly bounded [MP03] and, in fact, we have the following stronger moment 
condition 

E[exp(C.S r )|X r ] < exp(CX r + c( 2 ) 

for all C G K [PR1 1]. In [RoclO] this estimator was used in Step 1 of the boosted 
algorithm. On a balanced tree with \ogn levels, obtaining sufficiently accurate 
estimates of the coefficients in (3) requires accuracy l/f2(log(n)) on the edge 
weights. By (2), such accuracy requires a 0(log 3 n) sequence length. Using mis- 
specified edge weights in (3) may lead to a highly biased estimate and generally 
may fail to give a good reconstruction at the root. Here we achieve accurate hidden 
state estimation using only 0(log 2 n) samples. 

Recursive estimator We propose to construct an estimator of the form (3) re- 
cursively. For x G V with children yi, y 2 we let 

S V = Uly 1 S yi + UJy 2 Sy 2 , (4) 
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and choose the coefficients u yi ,u y2 to guarantee the following conditions: 

• We have 

E[S X | Z x ] = B{x)X x , 
with a bias term B(x) close to 1. 

• The estimator satisfies the exponential moment condition 

E[exp((S x )\Z x ] <exp((X x + c( 2 ). 

We show that these conditions can be guaranteed provided the model is in the 
KS regime. To do so, the procedure measures the bias terms B(yi) and 6(1/2) 
using methods similar to distance estimation. By testing the bias and, if neces- 
sary, compensating for any previously introduced error, we can adaptively choose 
coefficients L02 so that S x satisfies these two conditions. 

Unknown rate matrix Further complications arise when the matrix Q is not 
given and has to be estimated from the data. We give a procedure for recovering 
Q and an estimate of its second right eigenvector. Problematically, any estimate v 
of v may have a small component in the direction of the first right eigenvector of 
Q. Since the latter has eigenvalue 0, its component builds up over many recursions 
and it eventually overwhelms the signal. However, we make use of the fact that the 
first right eigenvector is identically 1 : by subtracting from S x its empirical mean, 
we show that we can cancel the effect of the first eigenvector. With a careful 
analysis, this improved procedure leads to an accurate estimator. 

2 Gaussian Model 

In this section, we prove our main theorem in the Gaussian case. The proof is 
based on a new hidden state estimator which is described in Section 2.1. For 
n = 2 h with/i > 0, let T = (V,E, [n],r;r) be a balanced tree. We assume 
that < r e < g, Ve G E, with < g < g^ s = In y/2. The significance of 
the threshold g^ s ls explained in Section 2.5 where we also prove Theorem 2. 
We generate k i.i.d. samples (X^) J fc =1 from the GMRFT model 2V,£ where k = 
k log 2 n. 

Our construction is recursive, building the tree and estimating hidden states 
one level at a time. To avoid unwanted correlations, we use a fresh block of 
samples for each level. Let K = n log n be the size of each block. 
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2.1 Recursive Linear Estimator 



The main tool in our reconstruction algorithm is a new hidden state estimator. 
This estimator is recursive, that is, for a node x E V it is constructed from es- 
timators for its children y, z. In this subsection, we let X v be a generic sample 
from the GMRFT independent of everything else. We let (XL,)^ be a block of 
independent samples at the leaves. For a node u E V, we let |_^J be the leaves 
below u and X|_ u j , the corresponding state. 

Linear estimator We build a linear estimator for each of the vertices recursively 
from the leaves. Let x E V — [n] with children (direct descendants) yi, y 2 . As- 
sume that the topology of the tree rooted at x has been correctly reconstructed, as 
detailed in Section 2.2. Assume further that we have constructed linear estimators 

S u = C u (X\_ u j) 

of X u , for all u E V below x. We use the convention that C U (X\ U \ ) = X u if u is 
a leaf. We let C x be a linear combination of the form 

S X = C X {X^) = UJ yi C yi (Xl yii ) + L0y 2 £y 2 (X\_y 2 j), (5) 

where — ideally — the u/s are chosen so as to satisfy the following conditions: 

1. Unbiasedness. The estimator S x = C X (X^ X ^) is conditionally unbiased, 
that is, 

E[^ | X x ] = X x . 

2. Minimum Variance. The estimator has minimum variance amongst all 
estimators of the form (5). 

An estimator with these properties can be constructed given exact knowledge of 
the edge parameters, see Section 2.5. However, since the edge parameters can 
only be estimated with constant accuracy given the samples, we need a procedure 
that satisfies these conditions only approximately. We achieve this by 1) recur- 
sively minimizing the variance at each level and 2) at the same time measuring 
the bias and adjusting for any deviation that may have accumulated from previ- 
ously estimated branch lengths. 
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Setup We describe the basic recursive step of our construction. As above, let 
x E V — [n] with children 2/i,y 2 and corresponding edges e x = (x,yi),e 2 = 
(x, 1/2). Let < 5 < 1 (small) and c > 1 (big) be constants to be defined later. 
Assume that we have the following: 

• Estimated edge weights f e for all edges e below x such that there is e > 
with 

\f e - T e | < £. (6) 

The choice of e and the procedure to obtain these estimates are described in 
Section 2.3. We let p e = e" fe . 

• Linear estimators C u for all u E V below x such that with 

E[S U I X u ] = B(u)X u , (7) 
where S u = C U (X\_ U \), for some B(u) > with \B(u) — 1| < S and 

V{u) = Var[5 u ] < c. (8) 

Note that these conditions are satisfied at the leaves. Indeed, for u E [n] one 

has S u = X u and therefore E[S U | X u \ = X u and V{u) = Vai[X u ] = 1. We 
denote f3(u) = — \nB(u). 

We now seek to construct S x so that it in turn satisfies the same conditions. 

Remark 3 In this subsection, we are treating the estimated edge weights and 
linear estimator coefficients as deterministic. In fact, they are random variables 
depending on sample blocks used on prior recurrence levels — and in particular 
they are independent of Xy and of the block of samples used on the current level. 



Procedure Given the previous setup, we choose the weights u Va , a = 1, 2, as 
follows. For u, v E V below x and £ = 1, . . . , K let 

and define 

f(«,«) = -ln 

the estimated path length between u and v including bias. Welet/?(«) = — kiB(u). 
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1. Estimating the Biases. If yi,y 2 are leaves, we let (3(y a ) = 0, a = 1,2. 
Otherwise, let z 2 i, z 22 be the children of y 2 . We then compute 




2. Minimizing the Variance. For a = 1, 2 we set w w , u y2 as 



B(y<*)Pe, 



(9) 



which corresponds to the solution of the following optimization problem: 



The constraint in the optimization above is meant to ensure that the bias 
condition (7) is satisfied. We set 



Bias and Variance We now prove (7) and (8) recursively assuming (6) is satis- 
fied. This follows from the following propositions. 

Proposition 1 (Concentration of Internal Distance Estimates) For all e > 0, 

7>0, < 5 < 1 and c > 0, there is k = k(e, 7, 5, c) > such that, with 
probability at least 1 — 0(n~ 7 ), we have 



for all u, v G {yi,y 2 , zu, z u , z 2 \, z 22 \ where z al , z a2 are the children ofy, 
Proof: First note that 



1, w w ,w w > 0}. (10) 



f(u, v) - (t(u, v) + 0(u) + 0(v))\ < e, 



E [S U S V ] 



E[E[S U S V \X U ,X V }} 
E[E [S U \X U ]E[S V \X V \] 
E [B{u)B{v)X u X v ] 
B(u)B(v)E uv , 
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where we used the Markov property on the third line, so that 



— In E 



1 K 



(u,v) + P(u)+ P(v). 



Moreover, by assumption, S u is Gaussian with 

E[S U ] = 0, Vai[S u ] = V(u) < c, 

and similarly for u. It is well-known that in the Gaussian case empirical covari- 
ance estimates as above have x 2 -type distributions [And58]. Explicitly, note that 
from 

it suffices to consider the concentration of S^, S%, and (S u + S v ) 2 . Note that 

Var[,S u + S v ] = V{u) + V(v) + 2B{u)B{v)Y, uv < 2c + 2(1 + 5f < +oo, 

independently of n. We argue about S^, the other terms being similar. By defini- 
tion, Su/V(u) has a %\ distribution so that 



E 



2(V(u) 



< +oo, 



(11) 



for \(\ small enough, independently of n. The proposition then follows from stan- 
dard large-deviation bounds [Dur96]. ■ 

Proposition 2 (Recursive Linear Estimator: Bias) For all 5 > 0, there is e > 
small enough so that, assuming that Proposition 1 holds, 

E[S x \X x ]=B(x)X x , 

for some B(x) > with \B(x) — 1| < 5. 

Proof: We first show that the conditional biases at y 1; y 2 are accurately estimated. 
From Proposition 1, we have 



\f(z 21 ,z 22 ) - (t(z 21 ,z 22 ) + (3(z 21 ) + (3(z 22 ))\ < e, 



13 



and similarly for f(yi, z 2 \) and f(yi, z 22 ). Then from (6), we get 

2^(2/i) = f(y 1 ,z 2 i) + f(y 1 ,z 22 ) - r(z 2 i,z 22 ) - 2f ei - 2f e2 

< (r(yi, z 2l ) + p{y x ) + /3(2 21 )) + (r( yi , z 22 ) + ^(y^ + (3{z 22 )) 

-(r(z2i, ^22) + /9(z2i) + /3(^2 2 )) - 2r ei - 2r e2 + 7e 
= W{Vi) + (t^i, 221) + r(yi, z 22 ) - t(z 21 , z 22 )) - 2(r ei + r e2 ) + 7e 
= 2/3(y 1 ) + ([T(y u y 2 ) + r(y 2 , z 21 )} + [T(y u y 2 ) + r(y 2 ,z 22 )} 

-[r(z 21 ,y 2 ) + r(y 2 , z 22 )}) - 2r(y u y 2 ) + 7e 
= 2(3( yi ) + 7e, 

where we used the additivity of r on line 4. And similarly for the other direction 
so that 

\P{Vx)-P{Vi)\<\e. 

The same inequality holds for y 2 . 
Given u m , u m , the bias at x is 

^[S X I X x ] = K{UJy 1 Sy 1 + LUy 2 Sy 2 \ X x ] 

a=l,2 
a=l,2 

Q=l,2 

= (^BCj/O/Oei +Uy 2 B(y2)Pe 2 )X x 

= B(x)X x , 

where we used the Markov property on line 2 and the fact that X v is Gaussian on 
line 5. The last line is a definition. Note that by the inequality above we have 

B(x) = u yx B{y x )p ei +u y2 B{y 2 )p e2 

= u yi e~^ yi) p ei + u y2 e~^ p e2 

< u yi e-^ +7 ^(p ei +e)+ ^ 2 e-^) +7 / 2£ (p e2 + e) 

= (u yi B(yi)p ei + ujy 2 B(y 2 )p e2 ) + max{w yi ,w w }0(e) 

= 1 + max{u yi ,u y2 }0(e), 
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where the last line follows from the definition of uj ya . Taking e, 5 small enough, 
from our previous bounds and equation (9), we can derive that u Va = 0(1), a = 
1,2. In particular, B(x) = 1 + 0(e) and, choosing e small enough, it satisfies 
\B(x) - 1| < 5. ■ 

Proposition 3 (Recursive Linear Estimator: Variance) There exists c > large 
enough and e, 5 > small enough such that, assuming that Proposition 1 holds, 
we have 

V(x) = Var[S x ] < c. 

Proof: From (9), 



n2 ^2 



<+< = l l^t* ^ + f,/l 2 oV2 )( l + 0(e + 5)) 



(Pl+Pl) 2 (Pl+Pl 
(l + 0(e + 5)) 



Pt, + P 



<■■> 



for e, 5 > small enough, where p* = e~ 9 so that 2(p*) 2 > 1. Moreover, 

Var^] = Vax[ujy 1 Sy 1 + ujy 2 S y2 ] 

= w^VarfS^] + u 2 y2 Yai[Sy 2 } + u yi Uy 2 E[S yi Sy 2 } 

< (w^ + wj 2 )c + ujy 1 Uy 2 B(y l )B(y 2 )T Juv 

< «+<)c + ^n^ 2 (l + 5) 2 

< c, 

taking c large enough. ■ 



2.2 Topology reconstruction 

Propositions 2 and 3 rely on the knowing the topology below x. In this section, 
we show how this is performed inductively. That is, we assume the topology is 
known up to level < h! < h and that hidden state estimators have been derived 
up to that level. We then construct the next level of the tree. 
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Quartet Reconstruction Let L h r be the set of vertices in V at level hi from 
the leaves and let Q = {a, 6, c, d} C L ft / be a 4-tuple on level hi. The topology 
of T restricted to Q is completely characterized by a bipartition or quartet split 
q of the form: ab\cd, ac\bd or ad\bc. The most basic operation in quartet-based 
reconstruction algorithms is the inference of such quartet splits. This is done by 
performing a. four-point test: letting 

F(ab\cd) = -[t(o, c) + r(b, d) — r(a, b) — r(c, d)}, 

we have 

{ab\cd if IF (a, b\c, d) > 
ac|6d if .F(a, b\c, d) < 
acZ|6c o.w. 

Note however that we cannot estimate directly the values r(a, c), r(b, d), r(a, b), 
and r(c, d) for internal nodes, that is, when hi > 0. Instead we use the internal 
estimates described in Proposition 1. 

Deep Four-Point Test Let D > 0. We let 

T(ab\cd) = -[r(a, c) + f(b, d) — f (a, 6) — r(c, d)], 

and 

SD(S) = l{r(x,y) < D, Vx, y G 5}. 
We define the deep four-point test 

FP(a, 6|c, d) = §©({a, 6, c, d})l{^(a6|cd) > //2}. 

Algorithm. Fix 7 > 2, < e < f/4, < 5 < 1 and D = Ag + 2 ln(l + 5) + 
e. Choose c, k so as to satisfy Proposition 1. Let Z be the set of leaves. The 
algorithm is detailed in Figure 1 . 

2.3 Estimating the Edge Weights 

Propositions 2 and 3 also rely on edge-length estimates. In this section, we show 
how this estimation is performed, assuming the tree topology is known below x' E 
Ly+i and edges estimates are known below level hi . In Figure 1, this procedure 
is used as a subroutine in the tree-building algorithm. 
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Algorithm 

Input: Samples (X?,)^; 
Output: Tree; 

• For h! = 0, . . . , h - 1, 

1. Deep Four-Point Test. Let 

Tfyi' = {<? = a6|cd : Va, b,c,d£ Zw distinct such that FP(g) = 1}. 

2. Cherries. Identify the cherries in TZ^i, that is, those pairs of vertices that 
only appear on the same side of the quartet splits in TZh> ■ Let 

7ul , _ /-(fc'+i) r ( ft '+ 1 ) i 

be the parents of the cherries in Z^ 

3. Edge Weights. For all x' £ Z h i +1 , 

(a) Let y[, y' 2 be the children of x'. Let z[, z 2 be the children of y[. Let 
w' be any other vertex in Z^ with SD({zi , z' 2 , y 2 > ^'D = 1- 

(b) Let e' x be the edge between and x'. Set 

f e j = o(4, 4; 2/2)^0- 

(c) Repeat interchanging the role of and y 2 - 

Figure 1 : Tree-building algorithm. In the deep four-point test, internal distance 
estimates are used as described in Section 2.1. 

Let y' 1 ,y' 2 be the children of x' and let e[, e 2 be the corresponding edges. Let 
u/ in L^/ be a vertex not descended from x'. (One should think of w' as being on 
the same level as on a neighboring subtree.) Our goal is to estimate the weight of 
e[. Denote by z[, z 2 the children of y[. (Simply set z[ = z' 2 = y[ if y[ is a leaf.) 
Note that the internal edge of the quartet formed by z[, z' 2 , y' 2 , w' is e[. Hence, we 
use the standard four-point formula to compute the length of e[: 

f e > = b(z[, z 2 ; y' 2 , w') = i(f(4, y'2) + t(z 2 , w') - f(z[, z 2 ) - ?(y' 2 , w')), 

and p e > = e Te 'i . Note that, with this approach, the biases at z[, z' 2 , y' 2 , w' cancel 
each other. This technique was used in [DMR1 la]. 
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Proposition 4 (Edge- Weight Estimation) Consider the setup above. Assume that 
for all a, b G {z[, z' 2 , y' 2 , w'} we have 

\f(a, b) - (r(a, b) + /3(a) + /3(6))| < e/2, 

for some e > 0. 77?en, |r e ^ — vj < e. 

This result follows from a calculation similar to the proof of Proposition 2. 

2.4 Proof of Theorem 1 

We are now ready to prove Theorem 1 . 

Proof: (Theorem 1) All steps of the algorithm are completed in polynomial time 
in n and k. 

We argue about the correctness by induction on the levels. Fix 7 > 2. Take 
5 > 0,0 < e < f /A small enough and c, k large enough so that Propositions 1, 2, 
3, 4 hold. We divide the k log 2 n samples into log n blocks. 

Assume that, using the first h' sample blocks, the topology of the model has 
been correctly reconstructed and that we have edge estimates satisfying (6) up to 
level h! . Assume further that we have hidden state estimators satisfying (7) and 
(8) up to level h' - 1 (if ti > 1). 

We now use the next block of samples which is independent of everything 
used until this level. When h! = 0, we can use the samples directly in the Deep 
Four-Point Test. Otherwise, we construct a linear hidden-state estimator for all 
vertices on level h! . Propositions 2 and 3 ensure that conditions (7) and (8) hold 
for the new estimators. By Proposition 1 applied to the new estimators and our 
choice of D = Ag + 2 ln(l + 8) + e, all cherries on level h' appear in at least one 
quartet and the appropriate quartet splits are reconstructed. Note that the second 
and third terms in D account for the bias and sampling error respectively. Once the 
cherries on level h' are reconstructed, Proposition 4 ensures that the edge weight 
are estimated so as to satisfy (6). 

That concludes the induction. ■ 

2.5 Kesten-Stigum regime: Gaussian case 

In this section, we derive the critical threshold for HSI in Gaussian tree models. 
The section culminates with a proof of Theorem 2 stating that TME cannot in 
general be achieved outside the KS regime without at least polynomially many 
samples. 
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2.5.1 Definitions 

Recall that the mutual information between two random vectors Y] and Y 2 is 
defined as 

J(Y i; Y 2 ) = H{Y X ) + H(Y 2 ) - H(Y U Y 2 ), 
where H is the entropy, that is, 

H(Y 1 ) = - J /i(yi)log/i( yi )eZyi, 

assuming Yi has density fi. See e.g. [CT91]. In the Gaussian case, if Yi has 
covariance matrix Si, then 

i/(Y 1 ) = ^log(27re)" 1 |S 1 |, 

where |Si| is the determinant of the n\ x ri\ matrix Si. 
Definition 5 (Solvability) Let be a GMRFT on balanced tree 

7-W = (V^ h \E^,[n^},r^;T^), 

where = 2 h and r e W = r > for all e e E^. For convenience we denote 
the root by 0. We say that the GMRFT root state reconstruction problem with r is 
solvable if 

limmfJ^U^^O, 

that is, if the mutual information between the root state and leaf states remains 
bounded away from as the tree size goes to +oo. 

2.5.2 Threshold 

Our main result in this section is the following. 

Theorem 4 (Gaussian Solvability) The GMRFT reconstruction problem is solv- 
able if and only if 

2e~ 2r > 1. 

When 2e~ 2r < 1 then 

/(^X^H^V 1 -*-^ , (.2) 

as h — > oo. 
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Proof: Fix h > and let n = n^ h \ 



W(X*>;X<5>), 



[[n]] = {0, . . . , n}, and p = e T . Assume 2p 2 ^ 1. (The case 2p 2 = 1 follows 
by a similar argument which we omit.) Denote by E^ and E^ the covariance 

matrices of and (Xq , X- ^ ) respectively. Then 



Let e n be the all-one vector with n elements. To compute the determinants above, 
we note that each eigenvector v _L e n of E^ gives an eigenvector (0, v) of E||^ 
with the same eigenvalue. There are 2 h — 1 such eigenvectors. Further e n is 
an eigenvector of E& with positive eigenvalue corresponding to the sum of all 
pairwise correlation between a leaf and all other leaves (including itself), that is, 



(The other eigenvectors are obtained inductively by noticing that each eigenvector 
v for size 2 h ~ 1 gives eigenvectors (v, v) and (v, — v) for size 2 h .) Similarly the 
remaining two eigenvectors of Efy, are of the form (1, (3e n ) with 






(1, (3e n )' = (1 + /32V, (P h + 0Rh)e n )' = A(l, 0e n )', 



whose solution is 



(R h -l)±y/(R h -l) 2 + 4p 2h 2 
2p h 2 h 



and 



A 



± 

h 



1 + ^2 h p h . 



Moreover note that 




l + (/3+ + /3 h -)2V + /3+/3-2 2 V 
l + (R h -l)- p 2h 2 h ] 

Rk - (2p 2 ) h . 
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Hence 



Finally, 




if 2p 2 < 1, 
-ilog(4r-l), if 2p 2 > 1, 



as h — > +00 with equation (12) established by a Taylor series expansion in the 
limit. ■ 



2.5.3 Hidden state reconstruction 

We make precise the connection between solvability and hidden state estima- 
tion. We are interested in deriving good estimates of xf^ given xW. Recall 

that the conditional expectation E[Xq |X^] minimizes the mean squared error 
(MSE) [And58]. Let AS* = (Er*?) -1 - Under the Gaussian distribution, condi- 
tional on X$, the distribution of Xq is Gaussian with mean 



and covariance 



= fe-C (13) 
l-p^e n A£V„ = l-^ = e-^. (14) 



R, 

The MSE is then given by 

E[rf' - E[X< fc) |X<$>]) 2 ] = E[Var[X^|xg)]] = e^. 
Theorem 5 (Linear root-state estimation) The linear root-state estimator 

P^ L {h 
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has asymptotic MSE < 1 as h — > +00 if and only if 2e 2t > 1. (Note that 
achieving an MSE ofl is trivial with the estimator identically zero.) 

The following observation explains why the proof of our main theorem centers 
on the derivation of an unbiased estimator with finite variance. Let Xk h) be a ran- 
dom variable measurable with respect to the a-field generated by X^) . Assume 

that E,[Xq^ \X^] = Xq, that is, X$ is a conditionally unbiased estimator of 
X { h) . In particular E[X^ h) ] = 0. Then 

E[(X { h) - aX { h) ) 2 ] = E[E[(X ( h) - aX ( h) ) 2 \X { h) ]] 

= 1 - 2aE[E[X [ h) X { h) \X { h) }} + aVarflf] 
= 1 - 2a + a 2 V&r[X { h) ], 

which is minimized for a = 1/Vsix[Xq ]. The minimum MSE is then 1 — 
l/Var[X^ h) ]. Therefore: 

Theorem 6 (Unbiased root-state estimator) There exists a root-state estimator 
with MSE < 1 if and only if there exists a conditionally unbiased root-state esti- 
mator with finite variance. 

2.5 A Proof of Theorem 2 

Finally in this section we establish that when 2e~ 2r < 1 the number of samples 
needed for TME grows like n 1 proving Theorem 2. 

Proof: (Theorem 2) The proof follows the broad approach laid out in [Mos03, 
Mos04] for establishing sample size lower bounds for phylogenetic reconstruc- 
tion. Let T and T be /i-level balanced trees with common edge weight r and 
the same vertex set differing only in the quartet split between the four vertices at 
graph distance 2 from the root U = {u±, . . . , 114} (that is, the grand-children of 
the root). Let {Xy}^ =1 and {Xy}^ =1 be k i.i.d. samples from the corresponding 
GMRFT. 

Suppose that we are given the topology of the trees below level two from the 
root so that all that needs to be reconstructed is the top quartet split, that is, how 
U splits. By the Markov property and the properties of the multivariate Gaussian 
distribution, {y^} u6 t/,i6{i,...,fc} with Y£ = E[X l u \ Xf,] is a sufficient statistic for 
the topology of the top quartet, that is, it contains all the information given by 
the leaf states. Indeed, the conditional distribution of the states at U depends on 
the leaf states only through the condition expectations. To prove the impossibility 
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of TME with high probability, we will bound the total variation distance between 
Y_ = {Y u } ueU and Y_ = {Y u } ueU . We have that Y_ is a mean Gaussian vector 
and using equations (13) and (14) its covariance matrix is given by 



(E^) un = Var[F n ] = e 



-2I h 



1 - 0((2p 2 ) h ) 



and 



J lf)uu' 



Cov[Y u , X u ]Cov[X u , X u ,]Cov[X u ,, Y u ,\ 

( 2p 2)2(fe-2) 



R h-2 



J U)uu' 



0((2p 



where £[/ is the covariance matrix of Xu- The covariance matrix of Y_ is defined 
similarly. Let (resp. K v ) denote the inverse covariance matrix (E^)" 1 (resp. 
(£^) _1 ). We note that and are close to the identity matrix and, hence, so 
are their inverses [HJ85]. Indeed, with Iu the 4 x 4-identity matrix, the elements 
of T, v — Iu are all 0({2p 2 ) h ) and, similarly for S^, which implies that 



sup|A; u , -A* u ,| = 0{{2p 



2\h\ 



(15) 



We let rf TV (-, ■) denote the total variation distance of two random vectors. Note 
that by symmetry | det A^| = | det A^| and so, with fy_ (y) the density function of 
Y, the total variation distance satisfies 



dTv(Y,Y) 




fy(y) 
My) 



fy_{y)dy 



exp 



cxp 



-\f~Kjy_+\fKjy_ 



3=1 

(exp [0((2p 2 ) h yD] ~ l) frWy 
(Eexp [0((2p 2 ) h Y u \)] - 1) 




0{(2p 



2^/^^ 
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where the first inequality follows from equation (15) while the second follows 
from an application of the AM-GM inequality and fact that the Y u . are identically 
distributed. The final equality follows from an expansion of equation (11). 

It follows that when k = o((2p 2 )~ h ) we can couple {Y*} ue u,ie{i,. ..,k} an d 
{y^}«ec/,ie{i,...,fc} with probability (1 - 0((2p 2 ) h )) k which tends to 1. Since they 
form a sufficient statistic for the top quartet, this top structure of the graph cannot 
be recovered with probability approaching 1. Recalling that n = 2 h , p = e~ T 
and that if 7 < (2r)/log2 - 1 then GMMFT 7 ' 5 is not solvable with k — n 7 — 
o((2p 2 )~ h ) samples. ■ 



In this section, we prove our reconstruction in the GTR case. We only describe 
the hidden-state estimator as the other steps are the same. We use notation similar 
to Section 2. We denote the tree by T = ( V, E) with root r . The number of leaves 
is denoted by n. Let q > 2, < / < g < +00, and T = (V, E, [n],r; r) E MY f ' g . 
Fix Q E Q q . We assume that < g < g^ s = In \[2. We generate k i.i.d. samples 
(Zy)\ =l from the GTR model (T,Q) with state space [q]. Let v 2 be a second 
right eigenvector of Q, that is, an eigenvector with eigenvalue —1. We will use 
the notation X % = i/L , for all u E V and i — 1, . . . , k. We shall denote the leaves 
of Tby [n). 

3.1 Estimating Rate and Frequency Parameters 

We discuss in this section the issues involved in estimating Q and its eigenvectors 
using data at the leaves. For the purposes of our algorithm we need only estimate 
the first left eigenvector and the second right eigenvector. Let n be the stationary 
distribution of Q (first left eigenvector) and denote II = diag(7r). Let 

be the right eigenvectors of Q corresponding respectively to eigenvalues 

= Ai > A 2 > . . . > X q . 

Because of the reversibility assumption, we can choose the eigenvectors to be 
orthonormal with respect to the inner product, 



3 GTR Model with Unknown Rate Matrix 




ie[ q ] 
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In the case of multiplicity of eigenvalues this description may not be unique. 

Proposition 5 There exists n(e, g, Q) such that given nlogn samples there exist 
estimators tx and v 2 such that 

||7r-7r||<e, (16) 

and 

2 = Y^^u l , (17) 



i=i 



where \ a 2 — 1 1 < e and \ ^\ < Q far I > 3, (for some choice of v if the second 
eigenvalue has multiplicity greater than 1 ). 

Estimates Let F denote the empirical joint distribution at leaves a and b as a 
q x q matrix. (We use an extra sample block for this estimation.) To estimate tx and 
v 2 , our first task is to find two leaves that are sufficiently close to allow accurate 
estimation. Let a*,b* G [n] be two leaves with minimum log-det distance 

(a*, b*) G argmin j-logdetF afe : (a, b) G [n] x 



Let 



F = F a * b * , 



and consider the symmetrized correlation matrix 

p\ _ I(_p a * b * 4- (F a * 6 *) T ). 
2 

Then we estimate tx from 

v'e[q] 

for all v G [q]. Denote II = diag(7r). By construction, ft is a probability distribu- 
tion. Let (p = r(a*, b*) and define G to be the symmetric matrix 

G = n-^Frr 1 / 2 = n-^rie^n- 1 / 2 = n^e^n- 1 / 2 . 

Then denote the right eigenvectors of G as 

/i 1 = n 1 / 2 */ 1 , /i 2 = n^V, . . . , = n^V, 
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with corresponding eigenvalues 

1 = 0« = e ^ Al > 0, (2 ! ... = e^ Aa > • • • > 0, (<? ! ... = 

(a*,o*) (a*,o*) — — (a*,o*) ' 

(2) f 

orthonormal with respect to the Euclidean inner product. Note that 9)1 bm s < e~ J 
and that u 1 is the all-one vector. Assuming % > 0, define 

g = n-v^n-va. 

which we use to estimate the eigenvectors and eigenvalues of Q. Since G is real 
symmetric, it has q real eigenvalues 0^ > 9^ 2 ' > ... > 9^ q \ with a corresponding 
orthonormal basis p 1 , p 2 , . . . , fjfl. It can be checked that, provided G > 0, we have 

1 = 0(i) > 0(2). We use 

V = 11 ' p, . 

as our estimate of the "second eigenvector" and 9^ as our estimate of the second 
eigenvalue of the channel. 

Discussion The sensitivity of eigenvectors is somewhat delicate [HJ85]. With 
sufficiently many samples (k = k log n for large enough k) the estimator G will 
approximate G within any constant tolerance. When the second eigenvalue is dis- 
tinct from the third one our estimate will satisfy (17) provided k is large enough. 

If there are multiple second eigenvectors the vector v 2 may not exactly be 
an estimate of v 2 since indeed the second eigenvalue is not uniquely defined: 
using classical results (see e.g. [GVL96]) it can be shown that z> 2 is close to a 
combination of eigenvectors with eigenvalues equal to 9^ 2 \ Possibly after passing 
to a different basis of eigenvectors u 1 , v 2 , . . . , u q , we still have that equation (17) 
holds. By standard large deviations estimate this procedure satisfies Proposition 5 
when k is large enough. 

Remark 4 This procedure provides arbitrary accuracy as k grows, however, for 
fixed k it will not in general go to as n goes to infinity as the choice of a* , b* 
may bias the result. An error of size 0(1/ y/k) may be obtained by taking all pairs 
with log-det distance below some small threshold (say Ag), randomly picking such 
a pair a', V and estimating the matrix G using a', b'. 

We could also have estimated tt by taking the empirical distribution of the 
states at one of the vertices or indeed the empirical distribution over all vertices. 
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3.2 Recursive Linear Estimator 

As in the Gaussian case, we build a recursive linear estimator. We use notation 
similar to Section 2. Let K = nlogn be the size of each block. We let Z v 
be a generic sample from the GRT model independent of everything else, and we 
define X u = v\ u for all u E V. We let (Z^)^ be a block of independent samples 
at the leaves, and we set X* = v\ t , for all u E V and £ = 1, . . . , K. For a node 
u E V, we let L^J be the leaves below u and X\ u \ , the corresponding state. Let 
< 5 < 1 (small) and c > 1 (big) be constants to be defined later. 

Linear estimator We build a linear estimator for each of the vertices recursively 
from the leaves. Let x E V— [n] with children (direct descendants) yi, 1/2- Assume 
that the topology of the tree rooted at x has been correctly reconstructed. Assume 
further that we have constructed linear estimators 

of X u , for all u E V below x. We use the convention that 

if u is a leaf. We let C x be a linear combination of the form 

S x = L x {Xy x \) = u yi £ yi (X\_ yii ) + Lo y2 £y 2 (X\_y 2 j), (18) 
where the lu's are chosen below. 

Recursive conditions Assume that we have linear estimators C u for all u below 
x satisfying 

E[S K \Z u ] = Y,B l M u , (19) 
1=1 

for some B l {u) such that \B 2 {u) — 1| < 6 and \B l (u)/B 2 (u)\ < gfor I — 3, . . . , q. 
Note that no condition is placed onS^w). Further for all i E [q] 

r u (C) <(EiS u \Z u = i]+c( 2 , (20) 

where as before 

rJ,(C) = lnE[exp(CS , „)|^„ = i]. 
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Observe that these conditions are satisfied at the leaves. Indeed, for u E [n] one 

has S u = v\ u = YjLi a i u z u and therefore E l s u I Z u ] = J2 q i a i u z u and K(0 = 
C^[S U I Z u — i]. We now seek to construct S x so that it in turn satisfies the same 
conditions. 

Moreover we assume we have a priori estimated edge weights f e for all e 
below x such that for e > we have that 



Let 9 e = e" fe . 

First eigenvalue adjustment As discussed above, because we cannot estimate 
exactly the second eigenvector, our estimate u 2 may contain components of other 
eigenvectors. While eigenvectors v z through v q have smaller eigenvalues and will 
thus decay in importance as we recursively construct our estimator, the presence 
of a component in the direction of the first eigenvalue poses greater difficulties. 
However, we note that u 1 is identically 1. So to remove the effect of the first 
eigenvalue from equation (19) we subtract the empirical mean of S u , 



As (n, v l ) = for / = 2, . . . , q and u 1 = 1 we have that E,S U = from 
(19) and hence the following proposition follows from standard large deviations 
estimates. 

Proposition 6 (Concentration of Empirical Mean) For u e V, e' > and 7 > 

0, suppose that conditions (19) and (20) hold for some 5, e and c. Then there exists 
k = k(s', c, 7, 5, s) > such that, when we have K > k log n then 



with probability at least 1 — 0{n 7 ). 

Proof: Let > 0. By Chernoff's bound, of the K samples, Ki are such that 
Zi = i where 



r e - r e | < e. 



(21) 




S u -B\u)\<e', 



u 
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except with inverse polynomial probability, given that k is large enough. By (19) 
and (20), we have 



where 

\^[{S u -B\u))\Z u = i\ 



1=2 



< (1 + S)(l + qg) max I/^/ttJ = T. 



Let e-p > 0. Choosing ( = |£ in Markov's inequality for e^ Su ~ B ^ gives that the 
average of — B l (u) over the samples with Z^ = i is within e r of Ylt=2 ^ l ( u ) l 'i 
except with probability at most e - £ r K (^~ £ ^)/ ic = l/poly(n) for k large enough 
and e w small enough. Therefore, in that case, 



i=l 



< qe v + e n [T + e r ] < e , 



for e n , €r small enough, where we used (it, u l ) — for I — 2, . . . , q. ■ 

For a = 1,2, using the Markov property we have the following important 
conditional moment identity which we will use to relate the bias at y a to the bias 
at x, 

E (Si - B\y a ) | Z x = i) ^J2J2 Bl (y-) M tH 



--2 j=l 



(22) 



where we used the fact that the v h s are eigenvectors of M^ a with eigenvectors 



9^ = exp(-XiT e ). 



Procedure We first define a procedure for estimating the path length (that is, the 
sum of edge weights) between a pair of vertices u\ and u 2 including the bias. For 
ui,u 2 £ V with common ancestor v we define 

(\ K 

f (tix, «a) = -In - K - S U1 ) (S C U2 - S U2 ) 
\ i=\ 
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This estimator differs from Section 2.1 in that we subtract the empirical means to 
remove the effect of the first eigenvalue. Using the fact that Yle=i &i ~ S Ul = 
and Proposition 6 we have that with probability at least 1 — 0(n~"') 

1 K 



< 



+ (5J 1 -B 1 («i)) (S e a2 -B\u 2 )) 



£=1 



and similarly the other direction so, 

1 K 

-j7 ^2 (^tl ~ (^2 ~~ ^2) 



< e 



/\2 



(23) 



It follows that f(ui,u 2 ) is an estimate of the length between U\ and u 2 including 
bias since 

^[(Si-B'im)) (Si 2 -B\u 2 ))} 

= ]T « - B 1 ^) I Z v = i) E (S^ - B 1 ^) I Z„ = i) 



ie[q] 



>4 



(0 ^ 



. Z=2 



. Z=2 



^wC) fla wC)+°(e) 



fi 2 («i)S 2 (n 2 )e^ T(ni ' n2) 



(24) 



where line 2 follows from equation (22). Above we also used the recursive as- 
sumptions and the fact that Yliek] ^ii^i) 2 = 1- We w ^ use me estimator t(u, v) 
to estimate (3(u) = — lnB 2 (u). Given the previous setup, we choose the weights 
u Va , a = 1,2, as follows: 
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1. Estimating the Biases. If 2/1,2/2 are leaves, we let (3(y a ) = 0, a = 1,2. 
Otherwise, let z a ±, z a2 be the children of y a . We then compute 

And similarly for y 2 . Let B 2 (y a ) = e~ p{ - Va \ a = 1, 2. 

2. Minimizing the Variance. Set io ya ,a = 1, 2 as 

W (25) 

the solution of the following optimization problem: 

min{a4+a4 : u yi &{y x )6?} +ui ya &(y i )e® = 1, u yi ,u y2 > 0}. (26) 

The constraint above guarantees that the bias condition (19) is satisfied 
when we set 

Bias and Exponential Moment We now prove (19) and (20) recursively as- 
suming (21) is satisfied. Assume the setup of the previous paragraph. We already 
argued that (19) and (20) are satisfied at the leaves. Assume further that they are 
satisfied for all descendants of x. We first show that the f -quantities are concen- 
trated. 

Proposition 7 (Concentration of Internal Distance Estimates) For all e > 0, 

7>0, < 5 < 1 and c > 0, there are k = k(e, 7, S, c) > 0, g — g(e, 7, 5, c) > 
such that, with probability at least 1 — 0(n~ 7 ), we have 

\f(u, v) - (t(u, v) + p{u) + 0{v))\ < e, 

for all u, v G {y±, y 2 , zn, z\ 2) Z21, ^22} where z a i, z a2 are the children ofy a . 

Proof: This proposition is proved similarly to Proposition 1 by establishing con- 
centration of j(Ylt=iStSvi where = Si — B x {u), around its mean which 
is approximately e^ T ^ u ' v ^^ u '~^ v ' by equation (24). The only difference with 
Proposition 1 is that, in this non-Gaussian case, we must estimate the exponential 
moment directly using (20). We use an argument of [PR1 1, RoclO]. 
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Let C > 0. Let N be a standard normal. Using that E[e aN ] = e a2/2 and 
applying (19) and (20), 

E[e^\Z {uM ] < E[e^ E ^ z ^ +c ^ 2 \Z {UtV} ] 
= E[e^ uE ^ z ^ +y/ ^^ uN \Z{ uv j} 

< ^ e (( E lSv\Z v ]+V^N)E[Su\Z u }+c((;E[S v \Z v }+ v ^<:N) 2 ^ z ^ j 

We factor out the constant term and apply Cauchy-Schwarz on the linear and 
quadratic terms in N 

E[e^\Z {uM ] 

< e tnSu\Zv]ms v \z v ] e cC 2 r 2 ^ e 4c?c 2 N 2 n/2 

vW \ p 2{V2E(E[Su\Z u ]+2cV2^(: 2 E[S v \Z v ])N\ 7r 1 1/2 

< CErS,|Z„lE[S P |Z P l r cC 2 Y 2 1 2 C T 2 C 2 (l+2cC) 2 

(l-8c 2 C 2 ) 1/4 
= 1 + (E[S u S v \Z {uA \ + T'C 2 + 0(C 3 ), 

as C — )• 0, where T was defined in the proof of Proposition 6 and T' > is a 
constant depending on T and c. Taking expectations and expanding 

e -C(E[&&]4*) E [ e C2«&] = l _ £ £ + T ^2 + ^3) < ^ 

for £ small enough, independently of n. Applying Markov's inequality gives the 
result. ■ 

Proposition 8 (Recursive Linear Estimator: Bias) Assuming (19), (20), and (21 ) 
hold for some e > that is small enough, we have 



E[S x \Z x ) = ^B l (x)v 



i 



for some B l (x) such that \B 2 (x) — 1| < 8 and \B l (x) / B 2 (x)\ < gforl — 3,...,q. 

Proof: We first show that the biases at yi, y 2 are accurately estimated. Applying 
a similar proof to that of Proposition 2 (using Proposition 7 in place of Proposi- 
tion 1) we have that 

\P(yi) - P(Vi)\ <0(e + g). 



32 



The same inequality holds for y 2 . Taking e, 5 small enough, our previous bounds 
on B, 9 and their estimates, we derive from equation (25) that u Va = 6(1), a = 
1, 2 with high probability. We now calculate the bias at x to be, 

E[S X \Z x = i] = E[u yi S yi + ojy 2 Sy 2 \Z x = i] 

a=l,2 1=1 

{=i 

f=i 



where we used equation (22) on line 2. Observe that since u yi ,u y2 are positive 

and < eg < eg 5 for Z > 3, 



B l (x) 



B 2 (x) 



<:'2 



< 



3 (2) 



(2) 



w Wie B 2 (i/ 1 )^+w we B 2 (y 2 )ft 



(2) 



u yi B 2 ( yi )e^+u y2 B 2 (y 2 )e, 



0- 



Applying the bounds on u; ya and /3(y a ) for a = 1, 2 we have that 



B 2 (x) 



(2) 



ei 



!)2 



< aj yi e-^ + °^ +s \eW + 0(e + g )) 

+u ya e-?to)+o(e+e)0W + (e + g)) 



B 2 

= l + 0(e + g). 
Choosing e and p small enough, it satisfies \B 2 (x 



II < 5. 



Proposition 9 (Recursive Linear Estimator: Exponential Bound) There is c > 
such that, assuming (19), (20), and (21) hold, we have for all i e [q] 

r x (()<cE{s x \z x = t] + c( 2 . 
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Proof: We use the following lemma suitably generalized from [PR1 1, RoclO]. 
Lemma 1 (Recursion Step) Let M = e TQ as above with eigenvectors 

with corresponding eigenvalues 1 = e Al > . . . > e Xq . Let 62, • • • , b q we arbitrary 
constants with < 2. Then there is d > depending on Q such that for all 
% E [q] 

F(x) = ^ Mijexp ( x^bii/j ] < exp (x^Xfiiul + c'x 2 j = G(x), 

je[q] \ 1=2 J \ 1=2 J 

for all i£l 
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We have by the Markov property and Lemma 1 above, 



rj,(0 = InE 



exp 



C E Sy^y* ) I z * 

\ Q=l,2 / 



E lnE t ex P (CSy a uJ ya ) \Z x = i] 



0=1,2 



o=l,2 \je[ff] 

= EME M ^ ex p( r L(c^J) 

< E ln ( E ex P tt"vMSy a I = j] + c(W y J 

a=l,2 Vje[ff] 



< 2 E < + E ln E M " ex p ( c«*. E 

Q =l,2 0=1,2 ye [9] 

< 2 E<+cE^«> 



1=1 



0=1,2 



Q = l,2 



+ E ln ( E M % exp ( E 

0=1,2 y ie[ g] \ i=2 

< < 2 E < + c E <** i&ivM + E c 'c 2 < 



Q = l,2 



Q = l,2 1 = 1 



0=1,2 



CE^lZ^z] +C 2 (c + c') ^ 



tin 



o=l,2 
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Take c large enough so that c + d < c(l + e') for some small e' > 0. Moreover, 
from (25) 

(l + 0(e + S + g)) 



0\ + oi 



< ^{1 + 0(6 + 5 + q))<1, 
where 9* = e~ 9 so that 2(6*) 2 > 1. Hence, 

r^(C)<CE[^|z x = z] + c c 2 . 



4 Concluding remarks 

We have shown how to reconstruct latent tree Gaussian and GTR models using 
0(log 2 n) samples in the KS regime. In contrast, a straightforward application of 
previous techniques 0(log 3 n) samples. Several questions arise from our work: 

• Can this reconstruction be done using only O(logn) samples? Indeed this 
is the case for the CFN model [Mos04] and it is natural to conjecture that it 
may be true more generally. However our current techniques are limited by 
our need to use fresh samples on each level of the tree to avoid unwanted 
correlations between coefficients and samples in the recursive conditions. 

• Do our techniques extend to general trees? The boosted algorithm used 
here has been generalized to non-homogeneous trees using a combinatorial 
algorithm of [DMRlla] (where edge weights are discretized to avoid the 
robustness issues considered in this paper). However general trees have, 
in the worst case, linear diameters. To apply our results, one would need 
to control the depth of the subtrees used for root-state estimation in the 
combinatorial algorithm. We leave this extension for future work. 
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